technovangelist / scripts / the ollama course - comparing quantizations

the ollama course - comparing quantizations

The Ollama Course

Take a look at these three paragraphs. I’ve just asked a question to a few different variations of llama3.1 using ollama. The question was “Describe a black hole in a single paragraph.” What do you notice is different about these three answers?

Can you figure out which is from a fp16 variant vs a 2 bit quantization? Is it obvious?

Welcome back to the Ollama course. This is a free course available here on this YouTube channel that will teach you everything you need to know about how to use Ollama to run artificial intelligence models locally on your laptop, on your computer, or even on an instance that you own up in the cloud.

So far in this course, we’ve mostly covered how to use the basic functionality of Ollama and how to find models. In this lesson, we’re going to look at what the different quantizations available are for most models and then compare them. If you look at the Discord for Ollama, you’re going to find lots of people that say the 4 bit quantization of any model is useless. There may be times where a 4-bit quantization of a model isn’t as useful as a larger version of that same model. But in most cases, it’s probably going to be more than adequate for most of your needs. But I don’t want you to take my word for it. I’d like to show you how the different quantization options perform for the same model.

To do this, I created a simple program that lets you pick a model that you have downloaded, and then enter a prompt. Then it will try that prompt for each of the quantizations of that model that you have downloaded. It’s also worth noting that most times you ask something of a model, you may get a different answer. So you if you are trying to assess any model, you really should ask the same question multiple times. In my tool I am asking 5 times for each quant, but there is a variable near the top where you can change it.

You can find the code for this program if you visit my repo on github. that’s github.com/technovangelist then look for the videoprojects repo. In there you should find 2024-08-20-quant-tester. I use Bun to run this and you can find the setup procedure at bun.sh. Bun is a super fast runtime and ecosystem for Typescript.

To run this, you should grab a few quantization variants of the model you want to try. So I am using llama3.1. so visit ollama.com/library/llama3.1. Then click on tags. We want the 8b instruct model, so I’ll start with the q2_k quant. click the link, then you will find the command to run it. Copy that. Now paste it into the terminal. Optionally you can change run to pull, or just quit out of ollama when its done downloading. Then do the same for the q4 and fp16 and any other quants you want to try. If you already have the models, you should regularly repull the models to get the latest versions. I use the fish shell on a mac, so I run this command every now and then which lists the ollama models, grabs the names, passes that to an echo command and then pulls that model. I used one of the models to generate that command, so if that doesn’t work for you, paste the command into a prompt for a model and tell it that this works on fish shell on a mac, but you have bash on arch or whatever you have and it will probably output the fix.

let’s start with my example from the beginning. And here were the three answers I highlighted. One was generated in about 2 seconds, another in 3, and another in 10. The 2 second answer used a 2bit quantization…it feels like pure magic that that even works. the 3 second answer used a 4bit quantization, and the 10 second answer used the 16 bit floating point weights.

Which is which? Is one of the answers 5 times better? or 3 times better to justify the time? Sure, 10 seconds isn’t a lot of time, but if you ask the model a lot of questions, then that time adds up.

Maybe with all three of them shown and with the tag displayed next to them, I could justify that the fp16 model gave a better answer, but just looking at the text without a label, I doubt anyone could say the q2 model is a q2 model.

I would say that all the answers are equally valid in different ways and if I could only use one question to judge which model to use, I would go with the fastest one.

Now how about if we change our prompt to the classic logic problem about three killers in a room. It’s a terrible question to judge models with because models suck at this. It’s like giving a model a math equation and expecting it to answer. Use Wolfram for that, stick to the stuff models are good at. but in this case the fp16 model did marginally better than the q4 model which was a lot better than q2.

And I tried both questions with the 70b model too. for the black hole question, if you put any of the answers in front of me I doubt anyone would be able to tell which was a 70b model and which was 8b, which was q2 and which was fp16. But for the logic puzzle, q2 got it right every time and q4 got it right 60% of the time. I don’t have enough memory to try fp16 for the 70b parameter model.

There are some who will say that you need the higher quants for function calling. There will be a future video in this course about function and tool calling. I have a few videos on this topic already elsewhere on my channel, but for the course I want to stick to just facts and less snarky opinion. So for this video I am just going to stick to the simpler version of the facts. but essentially at a high level, function calling is just about outputting as json so that your application can use the results from the model with a function in your app. you can call that function a tool if you would like. Variations include being able to provide a list of tools and the inputs they require and having the model choose the tool and output the proper json for that tool. Ollama recently announced support for tool calling, but it also announced support for tool calling 8 months earlier. Essentially there are two different procedures for tool calling. They don’t really have names, so let’s just say there is an old way and a new way. I would love to say the new way is better, but it simply isn’t, at least not yet. I don’t doubt that it will be in the future but that future isn’t here.

The new way requires you to provide the schema for the tools you want to run using a specific format that is somewhat documented. It also requires you to use specific models that are fine tuned for function and tool calling. When you follow the procedure, the Ollama API will then inject your schema for your tools into the prompt and feed that to the model. Unfortunately this fine tuning seems to require, in most cases, using the higher parameter models at the larger quantizations to get more consistent success. But depending on the model, success rates are still often only 60% to 90% when asking the same question multiple times.

The old way is left up to you to define the schema in the prompt yourself, and this can be confusing. But with a little practice it becomes pretty easy. Ideally you provide the schema, set the format to be json in the api call, and provide an example. When those are set, it is pretty hard to get it to fail, until you go to very small models like the 3 billion parameter variant of phi3.

I tried this with my test runner. There is a check in there that looks for the word json in the prompt and switches over to using format json. This isn’t ideal, but makes things easier for testing. The q2 quant got it perfect and 1 out of 5 tries failed for both q4 and for fp16. I wouldn’t have expected q2 to run better in this test, but I would imagine that if I had more tries for each, the success rate for both q4 and fp16 would go up.

Someone suggested that there was noticeable degradation when it comes to reasoning. Here is another example that requires a bit more of that: id The correct answer is 150 minutes or 2.5 hours. Q2 answered it correctly 3 out of 5 times. and both q4 and fp16 answered correctly every time.

I would think the next logical question you should have is something along the lines of noting that those questions arent the questions you would ask. You may think they are stupid questions. I don’t know your questions, and neither do the benchmarks. There are plenty of benchmarks that will claim one model is better than another, but you really need to ask your own questions to these models to figure out which one provides the best answers for you and which quantization tends to work for your questions. I think in general you should always tend to use the smallest parameter size and smallest quantization that gives you consistently good results in most cases. The idea of waiting minutes extra for a marginally better answer is ridiculous in most cases.

Use a tool like the one I created for this video or there are several others as well that are more polished that will help you judge answers from different variations of models. The important lesson here isn’t that these are the right questions to ask, but rather that you need to ask the questions that you tend to ask and you need to judge what works for you. Oh, and always pull the latest version of the model and ollama to ensure you have the best environment to test.

Thanks so much for watching this video in the course and I hope you have a great time making AI part of your life using Ollama. Goodbye.

the questions #

Describe a black hole in a single paragraph

q2 #

A black hole is a region of spacetime where the gravitational pull is so strong that not even light can escape its pull. It is formed when a massive star dies and its matter collapses into an incredibly small, infinitely dense point called a singularity. The gravity around this point becomes so intense that it warps the fabric of space and time itself, creating a boundary called the event horizon, beyond which anything that enters cannot return. This region is often referred to as a “point of no return” due to its inescapable gravitational pull, effectively cutting off any light or matter from the rest of the universe once it crosses this threshold.

q4_0 #

A black hole is a region in space where the gravitational pull is so intense that nothing, including light, can escape once it falls within a certain boundary known as the event horizon. This point of no return marks the edge of the black hole’s influence, and any matter or radiation that crosses it becomes trapped by an incredibly strong gravitational force, causing it to be pulled towards the center of the black hole at incredible speeds. As matter approaches the singularity at the heart of the black hole, its density and gravity become infinite, warping space-time in extreme ways, effectively creating a kind of cosmic void where the laws of physics as we know them cease to apply. Despite its seemingly “emptiness,” a black hole is actually an incredibly dense object with a mass that warps the fabric of spacetime around it, making it invisible and nearly impenetrable.

fp16 #

A black hole is a region in space where the gravitational pull is so strong that nothing, including light, can escape once it falls inside. It’s formed when a massive star collapses in on itself and its gravity becomes so intense that it warps the fabric of spacetime around it, creating a boundary called the event horizon. Once something crosses the event horizon, it’s trapped by the black hole’s gravity and pulled inexorably towards the singularity at its center, where the density and curvature are infinite, and the laws of physics as we know them break down. The point of no return is invisible to observers outside the event horizon, hence the name “black” hole - all that can be seen is the effect on nearby matter, such as stars or gas, which are pulled towards the black hole, emitting intense radiation and heating up as they’re stretched and compressed by its gravitational tidal forces.

70b no difference between

three killers are in a room. one person enters the room and kills someone in the room. how many people are still alive in the room. #

q2 #

never got the right answer

q4 #

Got the right answer 40% of the time

fp16 #

Got the right answer 60% of the time

70b - all right for q2, q4 right 60% of time, but not enough mem to run fp16